feat: Add BijectionConverter and BijectionAttack (#1903) by sajisanchu1913-source · Pull Request #1942 · microsoft/PyRIT

sajisanchu1913-source · 2026-06-04T22:30:00Z

Summary

Implements the Bijection Attack from arXiv:2410.01294 (Haize Labs) into PyRIT.

The attack works by teaching a target LLM a secret character mapping through
demonstration shots, then sending harmful prompts encoded in that mapping to
bypass safety filters. Responses are decoded using the inverse mapping.

Changes

New Files

pyrit/prompt_converter/bijection_converter.py — generates random letter-to-letter mapping, encodes prompts, decodes responses
pyrit/executor/attack/single_turn/bijection_attack.py — runs full bijection attack with teaching phase
tests/unit/prompt_converter/test_bijection_converter.py — 11 unit tests for converter
tests/unit/executor/test_bijection_attack.py — 5 unit tests for attack
doc/code/executor/attack/bijection_attack.ipynb — usage notebook

Modified Files

pyrit/prompt_converter/__init__.py — registered BijectionConverter
pyrit/executor/attack/single_turn/__init__.py — registered BijectionAttack

How It Works

BijectionConverter generates a random secret mapping (e.g. a→q, b→x...)
BijectionAttack sends teaching messages to target AI to teach the mapping
Harmful prompt is encoded and sent as TASK is '⟪encoded prompt⟫'
Response is decoded using inverse mapping
Decoded response is scored by the judge

Pattern Followed

BijectionConverter follows FlipConverter pattern
BijectionAttack follows FlipAttack pattern

Reference

Haize Labs implementation: https://github.com/haizelabs/bijection-learning
Paper: arXiv:2410.01294
Closes FEAT Bijection #1903

…dup and harm categories

… fix imports and ordering

- _RemoteDatasetLoader._fetch_zip_from_url: - keyword-only args (source, inner_files, cache) - streams download (requests stream=True + iter_content) to avoid double-buffering large archives - md5-keyed disk cache under DB_DATA_PATH / seed-prompt-entries when cache=True; named temp file otherwise (cleaned up after parse) - validates each inner_files extension against FILE_TYPE_HANDLERS; raises ValueError with a member preview if an inner file is missing - parses inner files via FILE_TYPE_HANDLERS and returns parsed dicts, so the open ZipFile never escapes the worker thread - adds the missing import zipfile that broke the previous commit - _MICDataset: - drops unused io / json / requests imports (helper handles them) - delegates download + parse to the helper; only owns the seed construction loop - guards non-string Q values (in addition to NaN moral values) - forwards cache from fetch_dataset_async to the helper - factors authors into AUTHORS class constant - Tests: - test_moral_integrity_corpus_dataset.py: stops mocking requests.get directly; patches _fetch_zip_from_url to return parsed dicts so tests don't depend on the helper's internal shape - adds test_fetch_dataset_non_string_q and test_fetch_dataset_passes_cache_flag - hoists imports into the right groups so ruff I001 stops firing - removes trailing whitespace / extra newlines - test_remote_dataset_loader.py: adds TestFetchZipFromUrl covering happy path, on-disk caching (hits 1 network call across 2 fetches), cache=False does not persist, missing inner file raises ValueError, unsupported extension raises ValueError Verified live against the real MIC.zip: 35,408 unique seeds across all 6 moral foundations in ~2.4s cold / ~1.3s warm. All 559 dataset unit tests pass; ruff clean. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Use tempfile.NamedTemporaryFile instead of fixed temp_audio.wav to prevent concurrent call collisions - Wrap Azure upload in try/finally to ensure temp file is always deleted even when upload fails - Add regression test to verify cleanup on upload failure Fixes microsoft#1894

- Add BijectionConverter that generates random letter-to-letter mapping - Add BijectionAttack that teaches the mapping to target AI and encodes harmful prompts - Add unit tests for both converter and attack - Add notebook demonstrating usage - Update __init__.py files to register new classes Based on arXiv:2410.01294 (Haize Labs bijection-learning)

romanlutz

This is a great start! There are a few things that need addressing but we're pretty close.

- Remove @pytest.mark.asyncio decorators (asyncio_mode=auto) - Fix __init__.py alphabetical ordering for BijectionConverter - Use patch_central_database fixture in attack tests - Use MagicMock(spec=PromptTarget) instead of plain MagicMock - Remove dead num_digits parameter - Add BijectionType StrEnum for bijection_type validation - Use private attributes with underscore prefix - Add _build_identifier() method - Fix teaching shots cap with programmatic cycling - Fix alternating user/assistant roles in teaching messages - Fix response decoding in _perform_async - Add BijectionConverter to _request_converters pipeline - Fix notebook format and add paired .py jupytext file - Register BijectionAttack in executor/attack/__init__.py

sajisanchu1913-source · 2026-06-15T05:08:59Z

Hi @romanlutz I've addressed all the review comments:

Removed @pytest.mark.asyncio decorators
Fixed init.py alphabetical ordering
Used patch_central_database fixture in attack tests
Used MagicMock(spec=PromptTarget) instead of plain MagicMock
Removed dead num_digits parameter
Added BijectionType StrEnum for validation
Used private attributes with underscore prefix
Added _build_identifier() method
Fixed teaching shots cap with programmatic cycling
Fixed alternating user/assistant roles in teaching messages
Fixed response decoding in _perform_async
Added BijectionConverter to _request_converters pipeline
Fixed notebook format and added paired .py jupytext file

Ready for re-review!

…ifier import

sajisanchu1913-source · 2026-06-15T05:28:27Z

Hi @romanlutz I've addressed the remaining review comments:

Resolved merge conflicts with upstream/main (kept BidiConverter from main, added BijectionConverter alphabetically)
Added end-to-end test in TestBijectionAttackEndToEnd that uses MockPromptTarget, returns a cipher-text response, and asserts the result is decoded back to plain text
Fixed ComponentIdentifier import to use pyrit.models.identifiers

Ready for re-review

romanlutz · 2026-06-15T13:43:22Z

+        prompt_normalizer: Optional[PromptNormalizer] = None,
+        max_attempts_on_failure: int = 0,
+        num_teaching_shots: int = 5,
+        bijection_type: str = "letter",


Two minor cleanups:

bijection_type is typed as str on the attack but BijectionType on the converter. Line 42 should be bijection_type: BijectionType = BijectionType.LETTER so the public‑facing attack matches the converter's signature. The StrEnum still accepts the literal "letter" at runtime, but the type annotation lies as written.

Optional[X] instead of X | None. Lines 37–39 use Optional[AttackConverterConfig], Optional[AttackScoringConfig], Optional[PromptNormalizer]. The codebase enforces PEP 604 (X | None) via ruff UP007/UP045 — pre‑commit will catch these. While you're at it, line 6 can drop Optional from the typing import.

romanlutz · 2026-06-15T13:43:22Z

+        # decode the response if there is one
+        if result.last_response and result.last_response.original_value:
+            decoded = self._bijection_converter.decode(result.last_response.original_value)
+            result.last_response.original_value = decoded


Blocking — mutating result.last_response.original_value corrupts the audit trail.

result.last_response.original_value = decoded result.last_response.original_value = decoded

last_response is a reference to a Message that has already been written to CentralMemory by PromptSendingAttack._perform_async. This in‑place mutation overwrites the recorded target response so memory now shows the decoded plain‑English text as if the target had returned it directly — the actual cipher‑text response from the model is lost from the audit log.

For an attack whose entire purpose is to produce harmful content in obfuscated form, losing the real model output is a significant integrity problem: future runs can't be replayed, the cipher‑shape (which is the evidence the attack worked) is gone, and any downstream analysis sees only the post‑processed version.

The decoded value should be attached alongside the original, e.g.:

Add a converted MessagePiece (preferred — that's what the converter pipeline normally produces, and it's what response converters in the normalizer do automatically).

Or store the decoded text in AttackResult metadata (result.metadata["decoded_response"] = decoded) and leave original_value untouched.

Related: this is another argument for letting the converter pipeline handle decoding (via response converters on the normalizer) rather than doing it manually here — the pipeline already preserves the original and adds converted values without mutation.

romanlutz · 2026-06-15T13:43:22Z

+    def test_teaching_messages_contain_secret_code(self, mock_objective_target):
+        attack = BijectionAttack(objective_target=mock_objective_target)
+        messages = attack._build_teaching_messages()
+        assert "secret code" in str(messages[0]).lower()


Brittle assertion — "secret code" is a literal string from the intro message that's quite likely to get reworded (e.g., if you switch the mapping turn to a system prompt per the paper, the wording will almost certainly change). The test will then break for a reason unrelated to what it's actually trying to verify.

Better to assert structural properties:

the first message has role="user" (or "system" after the fix)

the message count matches 1 + 2 * num_teaching_shots (intro + shot pairs)

subsequent messages alternate roles

shots contain the encoded form of examples[i]

These would catch real regressions (e.g., the alternating‑roles fix being undone) instead of just minor prompt rewording.

romanlutz · 2026-06-15T13:43:22Z

+            messages.append(Message.from_prompt(
+                prompt=f"{encoded} = {original}. Got it!",
+                role="assistant"
+            ))


Blocking — assistant teaching turns are plaintext English, not in‑cipher.

This change addresses my earlier comment about alternating roles, but the actual content of the assistant turns defeats the purpose. The paper (§2) specifies "in‑context User‑Assistant shots, with User messages in English and Assistant messages in the corresponding bijection language 'translation'". The current code does the opposite:

# line 92-95: ACK in plaintext English "Understood! I will use this secret code in our conversation." # line 109-119: user sends cipher and asks for confirmation, # assistant replies in a half-cipher/half-English translation echo f"In our code '{encoded}' means '{original}'. Understood?" # user f"{encoded} = {original}. Got it!" # assistant

The mechanism that makes the attack work is the assistant fluently producing cipher output — that's what induces the cipher‑shaped response distribution at inference time. Plain‑English ACKs plus cipher = plain translation echoes look to the model like "the user is showing me a translation key," not "I should produce text in this language."

Per the paper, the shot pattern should be:

User (English): "the quick brown fox"

Assistant (cipher): "ekt cvpjl mryio gyx"

And the ACK turn isn't needed at all — the paper just uses 10 translation shots, no separate acknowledgment.

romanlutz · 2026-06-15T13:43:22Z

+    LETTER = "letter"
+
+
+class BijectionConverter(PromptConverter):


Restructure recommendation: abstract BijectionConverter + 3 concrete subclasses, attack takes a converter instance.

After rereading the paper, the current single-class + BijectionType StrEnum design doesn't scale to what the paper actually requires. §2 specifies three bijection types — permuted alphabet, ℓ‑digit numbers, and tokens from the target's tokenizer — and explicitly notes their complexity parameters (fixed_size, ℓ, vocab subset) are what give the attack its scale‑adaptive property. So implementing only LETTER understates what the attack claims to do.

Stuffing per‑mode params (num_digits for digits, tokenizer for tokens) onto a single class produces dead‑param footguns (BijectionConverter(bijection_type=DIGITS, fixed_size=5) would silently mix modes). Subclasses give honest signatures:

class BijectionConverter(PromptConverter, abc.ABC): def __init__(self, *, mapping: dict[str, str] | None = None, seed: int | None = None) -> None: rng = random.Random(seed) self._mapping = mapping if mapping is not None else self._generate_mapping(rng) self._inverse_mapping = {v: k for k, v in self._mapping.items()} @abc.abstractmethod def _generate_mapping(self, rng: random.Random) -> dict[str, str]: ... async def convert_async(...): ... # shared def decode(...): ... # shared def _build_identifier(...): ... # shared class LetterBijectionConverter(BijectionConverter): def __init__(self, *, fixed_size: int = 0, mapping=None, seed=None): ... class DigitBijectionConverter(BijectionConverter): def __init__(self, *, num_digits: int = 2, mapping=None, seed=None): ... class TokenBijectionConverter(BijectionConverter): def __init__(self, *, tokenizer, mapping=None, seed=None): ...

The base class gets two things that are needed now, not just for future modes:

seed — currently random.shuffle uses the global RNG with no way to reproduce a mapping. Red‑team work needs replay; a seed parameter (constructing a local random.Random(seed)) is the standard fix.

mapping — accept an explicit dict[str, str] so callers can replay a known successful mapping or run deterministic experiments. The test file currently works around the lack of this by reading converter.mapping after random generation, which is awkward.

BijectionAttack then simplifies dramatically — it just accepts a converter instance:

class BijectionAttack(PromptSendingAttack): def __init__( self, *, objective_target: PromptTarget = REQUIRED_VALUE, bijection_converter: BijectionConverter = REQUIRED_VALUE, # this could also be None and have the letter version as default num_teaching_shots: int = 10, ... ): ...

This drops bijection_type, fixed_size, and the type‑confused bijection_type: str = "letter" annotation. The user composes:

attack = BijectionAttack( objective_target=target, bijection_converter=DigitBijectionConverter(num_digits=2, seed=42), )

I'd push for all three modes in this PR — landing only LETTER and adding the rest later means either a breaking API change (when the inevitable bijection_type / per‑mode params get reshuffled) or sticking with the current sub‑optimal design. The architectural cost is paid once now; the alternative is paying it twice.

Acknowledging this is a bigger ask than my earlier comments — happy to discuss if you'd prefer to land LETTER only and follow up, but I think the restructure is worth it.

- Change Optional[X] to X | None (PEP 604) - Change bijection_type: str to BijectionType in attack - Register BijectionType in prompt_converter __init__.py - Store decoded response in metadata instead of mutating last_response - Fix teaching shots: user sends English, assistant responds in cipher - Fix brittle test assertions to check structural properties - Update end-to-end test to check metadata for decoded response

sajisanchu1913-source and others added 12 commits May 28, 2026 17:14

FEAT: Add SALT-NLP Moral Integrity Corpus (MIC) dataset loader

ff0843e

FEAT: Add SALT-NLP MIC dataset loader with tests and documentation

83dd517

REFACTOR: Rename to moral_integrity_corpus_dataset, fix async, add de…

abc1e16

…dup and harm categories

fix: address reviewer feedback - fix NaN crash, add liberty category,…

88f89f0

… fix imports and ordering

fix: correct import ordering and trailing newline

fedba1c

fix: add reusable _fetch_zip_from_url helper to base class

cf197d9

Merge branch 'main' into main

039e713

Merge branch 'microsoft:main' into main

010a439

fix: add missing newline at end of file

056e938

sajisanchu1913-source mentioned this pull request Jun 4, 2026

FEAT Bijection #1903

Open

romanlutz reviewed Jun 15, 2026

View reviewed changes

sajisanchu1913-source added 2 commits June 15, 2026 01:20

fix: resolve merge conflicts with upstream/main

1973122

fix: add end-to-end test for response decoding and fix ComponentIdent…

9f0ac6d

…ifier import

romanlutz reviewed Jun 15, 2026

View reviewed changes

Conversation

sajisanchu1913-source commented Jun 4, 2026

Summary

Changes

New Files

Modified Files

How It Works

Pattern Followed

Reference

Uh oh!

romanlutz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sajisanchu1913-source commented Jun 15, 2026

Uh oh!

sajisanchu1913-source commented Jun 15, 2026

Uh oh!

romanlutz Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

romanlutz Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

romanlutz Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

romanlutz Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

romanlutz Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants